Teachers : Prof. M.-O. Boldi - Prof. M. Baumgartner Fall Semester 2022

Introduction

We decided to analyze a TV series for this text mining project. In particular, one of the most successful comedies of the last years: The Big Bang Theory. We looked at the script of the 10 seasons of this series to make a detailed analysis, and try to understand, without having seen the series, the general framework that emerges. Our objective is to produce a detailed report based on an original database.

Overview

Data set web-scraped from https://bigbangtrans.wordpress.com/.

Project objectives

Our goal is to use the relevant text mining - machine learning tools, with supervised and unsupervised learning methods to characterize our data frame. In our case, it would be a question of understanding the framing of the TV show, with sentiments analysis, vocabulary richness, and topics analysis.

Structure of the report

  • Introduction
  • Part 1 : Data preparation
  • Part 2 : Exploratory Data Analysis (EDA)
  • Part 3 : Analysis
    • 3.1 Sentiment Analysis by season
    • 3.2 Sentiment Analysis by character
    • 3.3 Similarities and Topic Modelling
  • Part 4 : Machine Learning
    • Supervised learning analysis
    • Unsupervised learning analysis
  • Conclusion

Part 1 : Data preparation, overview of the data set used.

Web-scraping

The data we are using within this project is coming from the website “Big Bang Theory Transcripts”. It is accessible in the following link : https://bigbangtrans.wordpress.com/

We decided to web-scrap the data and to create csv files to stock them. Indeed, it is easier for us to get the data and have it locally on our files so that whenever we want to work with them again, we do not need to web-scrap again. We will directly be able to use the files we created.

Description of the data sets we created.

We crated many different files as we want to make several analysis.

“series_scripts.csv”

The first csv file is called “series_scripts.csv”. It is available in the “data” folder of our project. This data set contains 231 rows and 3 columns:

  • Each row represent the script of one episode. In this series of 10 seasons, there is a total of 231 episodes.
  • The column document : represent the index and tells us which document it is. The information are in character class.
  • The column title : contains the title of the script. It tells us the series and episode number. The information are in character class.
  • The column script : contains the whole script for each corresponding episode. Note that we decided to use the symbol ‘\(\\\)’ for the end-line. This indeed is helpful so that we manage to put the whole script in one cell. The information are in character class.

We do not show how the data look like in here, simply because the scripts column contains a lot of text and if would take way too much place in the report. You are invited to open the csv files if you want to get an overview.

“season_scripts.csv”

We created a second csv file named “season_scripts.csv”. Indeed, we quickly realized that the analysis per episode would fast become tedious and less meaningful. Therefore, we decided to aggregate the episodes by season. This way we get all the scripts of each season’s episode on one concatenated string.

This data set contains 10 rows, each row representing one season and 2 columns:

  • The column season : indicate the season number.
  • The column agg_script : contains all the aggregated scripts per season. It means that there is the script of each episode of one season within one cell of the dataframe.

Since the table is quite long and even one row is very long to output, we did not add an annex of the data frame output. We will recommend you to directly go on the csv file if an overview of the data set is needed.

The various files ‘character_speech’

We created 10 other files based on the web-scraping. Indeed, we decided to have one file per season because we wanted to have one row new row each time a character is speaking.

It means that each file has a different length depending on how many time there is a change of interlocutor. However, each file has the following column structured:

  • One column season : indicate the season number the script line is referring to.
  • One column main_character_script : is the whole script line with the name of the person speaking
  • One column character_name : contains the name of the person speaking the line
  • One column character_scripts : contains the script line of the interlocutor

Then we combined all these rows to have one main file named ‘character_speech.csv’. The two first rows are printed below so you can have an overview of this dataset.

The 2 first row of the data set character speech.csv
season main_character_script character_name character_scripts
3 1 Sheldon: So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits. Sheldon So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.
4 1 Leonard: Agreed, what’s your point? Leonard Agreed, what’s your point?

Part 2 : Exploratory Data Analysis (EDA)

In this part, we perform an Exploratory Data Analysis. We will clean and infer some first results based on our data sets.

EDA for the seasons analysis

Tokenization and cleaning of the data

As we know that we want to conduct a sentiment analysis on the the seasons of the series, we first define the corpus for our analysis. The corpus will be the agg_scripts column in our season_scripts data set as it contains all the texts that must be analysed. Once defined, we clean the texts by removing the numbers, the punctuation, the symbols, the separators and the stop words appearing in the English language.

In our scripts, we can see the main characters that are Sheldon, Leonard, Penny, Howard and Raj. They appear in each seasons’ episodes. Therefore, we remove their name for the following explanatory data analysis since it will otherwise bias our results.

We have created a variable ‘characters_names’ grouping the names of all the recurring characters during the series. Indeed, these are, logically, the words that come up the most often and this is not the purpose of our analysis here. It is the same with the words grouped in the variables ‘words_to_remove’, which are themselves linked to the stage directions or do not bring any added value.

In here, we also perform the lemmatisation which is replacing some vocabulary by removing the inflectional endings and only keep the base or the corresponding form from the dictionary of word. It will give the lemma of each word from that dictionary.

The Document-Term Matrix, the TF-IDF and the global frequencies

We use the TF-IDF method to look at the specificity of the terms through seasons. It allocates a weight to the most frequent words and translate a relevance of terms in the corpus. With the following graph, we can see that the most frequent term, is by far ‘Time’. Indeed with the frequency matrix, we can see that ‘Time’ appears more than 900 times in the whole show.

We observe that the most frequent terms are part of the feeling lexical field. For instance the words ‘love’, ‘feel’, ‘fine’. With only this first representation, we can think of a positive pattern in the show.

Cloud of words

The bigger the word, the more frequently it appears. The most frequent words seems to be “time”, “guy”, “love”, “feel”, etc. Does this means that the seasons have mostly positive sentiments ? We will analyse that throughout our report.

The 10 most frequent words per season

Then, we plot the 10 most frequent words per season and observe that it did not bring much information. Therefore we did not show the output of the graph here. Indeed, it is not really relevant because ‘time’ and ‘guy’ are always predominant through all the corpus and for each season, the plot was telling us the same thing. It looks logical to have them appearing individually.

The TF-IDF per season

Again, here we plot the 10 most specific words per season. In the season 1 the term ‘gablehouser’ is very specific. While in the season 4 the most specific term is ‘sheldon_bot’, we could imagine that they were trying to build a robot on Sheldon. If we look at season 10 the verb ‘born’ is very specific to this season, we can guess maybe an important event happened there.

Representation of the term frequencies

On this plot, we cannot see very clearly as there are a lot of terms. But we see that “time”, “guys” are indeed very frequent as seen previously and that “gablehouser” and “sheldon-bot” are very specific to one particular season each. For the term “gablehouser”, it is season 1 after research. In the show it is actually a character Dr. Eric Gablehauser, he does not appear in the show after the 2 first season and we did not notice him before to remove it with the other characters’ name.

Lexical diversity

We compute the lexical diversity of the scripts and we see that the season, especially the last (8, 10, 7, 9) do not have very diverse lexical. Indeed the TTR is equal to 0.25, which means that in a sequence of 4 words, 3 words are the same and only 1 is different. The season with the richest lexical seems to be the season 2 iwth a TTR score almost 0.5 .

Keyness analysis

We made a Keyness analysis, to understand what is the ration of the terms in the target compared to others in the rest of the corpus.

  • In the first graph, we analysis season 5 as the target. The reference is the rest of the corpus, and we observe that the word ‘siri’ is the most used in the script of season 5 compared to the rest.

  • In the second graph, we analysis season 7 as the target. The reference is the rest of the corpus, and we observe that the word ‘element’ is the most used in the script of season 7 compared to the rest.

Co-occurence analysis

Next, we decided to create a co-occurrence matrix to have an overview on which word often appears together.

Because it is very difficult to see when we have too many words, we decided to restrict the co-occurrence matrix to terms that co-occur more than 300 times together.

The representation matrix helps us to understand how many times the most frequent words co-occur together in all the corpus. For example, ‘time’ and ‘guy’ co-occur 138’402 times together in the corpus, which means that they are used in the same context in the script.

Below is a plot of the co-occurrence graph. Each connection means that the two words appear together more than 30000 times. We see that at the center we have the terms ‘time’, ‘talk’, ‘guy’ which means that these words often appear along with the others.

EDA for the character analysis

Tokenization and cleaning of the data

As with the season data we decided to make a tokenization and to do some pre-process cleaning of the data. Indeed, we removed all the numbers, punctuation, symbols, separators, stopwords, the same vector containing the characters’ names and the vector containing the words we judged not insightful for our analysis. Then we conducted a lemmatisation.

The 10 most frequent words per character

We show here the most used words per character. We see that leonard often uses the word ‘love’ and Penny often uses the word ‘fine’. Our first idea could be that Leonard is a very positive person, if the word ‘love’ is often pronounced by him.

The TF-IDF per character

Next we want to have an idea of what word is specific to which character. Therefore we plot the 10 most specific terms per character. Interestingly, the terms are very similar, but the allocation to the character is slightly different. From this plot, it seems that the word ‘remarkable’ are quite specific to the character Howard. While the terms ‘lord’ and ‘beverage’ are quite specific to Leonard. Maybe Penny has a tick word which is ‘hee’, as it is quite very specific to her.

Representation of the term frequencies

This plot allows us to confirm our first insight. Again, we have alreasy seen previously in the season analysis that terms such as ‘time’, ‘guy’ and ‘talk’ are very frequent throughout the whole seasons and here they are present throughout each character so they are not specific to anyone. On the contrary, we see words such as ‘hee’ and ‘remarkable’ are very specific to a certain character.

Lexical diversity

We compute the lexical diversity of each character. It seems that Raj has the most diverse lexical with a Token-Type-Ratio of a bit mit more than 0.3. Indeed, in the series, he sometimes even speak in Hindi. Surprisingly, Sheldon has the less diverse vocabulary with a TTR of less 0.25, we were expecting more since he seemed to be the most well-known character of the series.

For the EDA of the character, we did not judge relevant to perform a co-occurrence analysis as the script is the same regardless if it is separated per season or per characters.

Part 3 : Ananlysis

Part 3.1 Sentiment Analysis by season

Sentiment Analysis

In this following part, we want to compute the sentiment of each season. To do so, we decide to first used the ‘NRC’ dictionary to perform the analysis. Then we want to compare our results to another analysis using another dictionary, the ‘afinn’ dictionary.

NRC dictionary

The NRC dictionary contains a list of English words and their associations with eight basic emotions and two sentiments (positive or negative). These emotions are anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. Based on this dictionary, one English word can be associated to several emotions. For example, we see on the following table that the term ‘abandon’ is associated to several emotions (fear, negative, sadness).

For each token in our data scripts, we join the corresponding sentiment qualifier in “nrc” using the inner.join() function from dplyr: Below, you can see the first 10 rows of the dictionary.

Sentiment Analyis

To compare the documents, each season, we rescale them by their length (i.e. the frequencies of sentiments are computed, by document): By re-scaling, we see that all seasons follow the same pattern, meaning that they are mainly positive and reflecting the sentiment of trust while very few of disgust sentiments appear. We can understand that the pattern is rather recurrent based on the consistency on the characters humors and mood. Also if the comedy works well, the directors will follow the same process over and over the season. The feeling of trust is also well present which might reflect the friendship aspect. Indeed it is the main component of the show.

AFINN dictionary

Now, on this part, we will us the AFINN dictionary. This dictionary contains a list of English words manually rated for valence with an integer between -5 (very negative) and +5 (very positive) by Finn Årup Nielsen.

We see on the below table that the word ‘abandoned’ has a value of -2 which means that it is relatively negative.

The 6 first row of AFINN dictionary
word value
abandon -2
abandoned -2
abandons -2
abducted -2
abduction -2
abductions -2

We see there on the table below that the seasons 3, 2 and 7 are categorized as relatively negative with season 3 being the most negative.

From this analysis we can observe that the second half of the show, except for season 7, has a more positive score. We can suppose that the characters tend to become more positive and maybe less sarcastic as the series goes on

Quanteda analysis

From another dictionary named ‘data_dictionary_LSD2015’, we see pretty much the same analysis.

Valence-Shifters analysis

Valence shifters are words that alter or intensify the meaning of the polarized words and include negators and amplifiers.

  1. Valence shifter
  2. Number key value corresponding to:

Valence Shifter Value Negator 1 Amplifier (intensifier) 2 De-amplifier (downtoner) 3 Adversative Conjunction 4

We can see with this density graph, the average sentiment on the wole season, with the valence shifters. It tends on average to be superior to 0, the mean around 0.8.

Valence shifter approach on each season

The analysis is possible by going through each sentence. Looking at season 9 for instance, the end of the show looks to be very positive with an increasing trend. The pattern shows that sometimes we have high peaks like in season 4. And sometimes very low peaks but it remains well distributed. Indeed if we take a look at season 10, we have 4 drops in the negative and it looks to appear every 1000 sentences.

With this barplot, we can once again identify the scaled average sentiment and see that the most positive season is 10 and the least positive season is 2. The main difference with the valence shifter shows that season 10 is way more positive with this analysis compared to the previous one. It shows how important we must consider this aspect.

Part 3.3.1 : Topic Modelling

Similarities between season scripts

In this part, we want to compare the similarities of the scripts between the season. We decided to use the 3 similarities measure to compute the similarity matrix: Jaccard similarity, Cosine distance, and Euclidean distance. The best/explanatory method in our case seems to be the Euclidean one and we will therefore concentrate on it for the following.

From the Euclidean co-occurence matrix plot, it seems that season 4 is close to every other seasons. On the contrary, season 6 and season 2 are more distant (but we know that there may be a problem in webscrapping).

Clustering

Then, to create a cluster, we decide to focus on the Euclidean distance only.

##       Clust.1     Clust.2     Clust.3
## 1 gablehouser        siri sheldon-bot
## 2         hee switzerland      latham
## 3      no-one         bom       glenn
## 4         leo       flags        todd
## 5        halo     crawley       troll

Not surprisingly here, we find the same characteristics for seasons 4, 2 and 6 as before regarding their proximities.

Similarities between words

We use the cosine distance measure to determine the similarities between words.

Clustering words


We decided to represent the similarities of words with a cluster dendogram rather than a matrix. With the matrix the the interpretation is harder to read. The method used here in the cluster distance is 1 - Similarities (cosine). As a result : ‘feel’ and ‘happy’ are really close. However when we compare ‘live’, ‘baby’ and ‘night’ and ‘friend’ are very distant. Indeed if it refers to ‘baby’ as a child, I guess it is not used in the same scene as ‘night with friend’.

Part 3.3.2 : Term-Topic Analysis

We want to analyze the topics of the season scripts using LSA and LDA.

As the first dimension is often linked to the document length, we wanted to verify that this was the case. And indeed, the dimension 1 is negatively correlated with the document length.

Then we did an analysis of topics 2 and 3.

  • Topic 2 is associated positively to “baby”, “feel” and “love” and negatively with “ring”, “night” and “mother”.
  • Topic 3 is associated positively to “time”, “gablehouser” and “enter” and negatively with “ring”, “past” and “feel”.

In order to visually represent the relationship between topics 2 and 3, seasons and words, we perform an LSA-based biplot. Because of the large number of terms, the interpretation is difficult. Below, you can see the chart to the terms that are mostly related to the dimensions 2 and 3


The biplot shows that Topic 2 is associated with seasons 7,8,9 and 10 and with the words “love”, “baby”, “happy”,“guy” and anti-associated with season 3 and with the words “ring”, “friend”, “mother”. Topic 3 is associated with seasons 1 and 4 and with words “enter”, “gablehouser”, “machine” and anti-associated with “ring”, “fine”, “day”.

LDA using quanteda

We now turn to an LDA. We started with 10 topics and then eliminated the non-miningful ones until we get to choose 5 topics. These topics are related to the words below.

##      topic1        topic2    topic3   topic4     topic5  
## [1,] "page"        "past"    "play"   "baby"     "time"  
## [2,] "sister"      "enter"   "wed"    "feel"     "guy"   
## [3,] "sheldon-bot" "leave"   "hawk"   "birthday" "talk"  
## [4,] "latham"      "voice"   "space"  "flag"     "friend"
## [5,] "girlfriend"  "alright" "kripke" "kitchen"  "call"

Term-Topic Analysis

The “phi” provides the probabilities of selecting a term given that it is a given topic. For a given topic, the largest phi provide the terms that are most associated with the topic.


Here we plot the 5 largest probability terms within each subject. Despite the fact that these terms have the ighest probability to appears in this topic, their phi values are relatively low and therefore they are only slightly more common than the other terms.

Topic-Document Analysis

The “theta” provides the probabilities (i.e., proportions) of the topics within each document (season).

This graph shows us that Topic 5 is present at more than 50% in all seasons. Topic 1 is more related to season 4, Topic 2 to seasons 1,2,3, Topic 3 to seasons 5,6,8 and Topic 4 to seasons 9 and 10. It seems to be a chronological link between the topics and the seasons.

Part 3.2 : Sentiment Analysis by character

After the analysis of the script according to the seasons, we wanted to see how the five main characters (Sheldon, Leonard, Penny, Raj and Howard) of the show impact the script of the show through a sentiment analysis.

We use a data set with which an observation is given for a character according to the sentence he says. Thus, we can use again here the ‘nrc’ lexicon and have an idea of the dispersions of the feelings for each of our characters.

## `summarise()` has grouped output by 'character_name'. You can override using
## the `.groups` argument.


First, we see in this graph that Sheldon seems to be the most ‘intense’ character. In the sense that he is the one who uses the most words that can be categorized by a feeling. Then we notice an identical pattern in all the characters. Indeed, we have a prevalence of positive words then negative and on the contrary less words related to the feelings ‘trust’ and ‘disgust’.

Since in the previous analysis negative and positive feelings predominate, we wanted to try to use another dictionary. This is a General purpose English sentiment lexicon that categorizes words in a binary fashion, either positive or negative.

Negative-Positive Ratio in all seasons by using bing lexicon


We obtain a surprising result considering our previous findings. Indeed, we notice that the ‘negative’ represents a major part in all the characters. This is contradictory with the results of the nrc lexicon (why ?). We also notice that Sheldon is the most negative character and Penny the most ‘positive’ character. This analysis is consistent with our previous results. Also, we can imagine that some seasons are more or less pleasant for our characters. For example, Raj seems to have used more ‘positive’ words in season 9 and Leonard in season 2 while Sheldon uses more negative than positive words in seasons 1, 3 and 7.

Valence shifter approach on each character


The analysis is possible by going through each sentence. The most ‘instense’ character is Sheldon, he appears to be very expressive in the positive like in the negative. However we can guess that he tends to be less and less negative through the end of the show. The lowest peaks are a little bit less frequent.


With this barplot, we can once again identify the average sentiment and see that the most positive character is Penny with an average of 0.13, followed by Leonard. The least positive character is Sheldon.

Supervised Learning

Features: DTM and LSA

First we create the corpus from the data set “character_speech”. Within this data set every line is coupled to the characters (Howard, Leonard, Penny, Raj, and Sheldon). The variable y is the character name (Howard, Leonard, Penny, Raj, and Sheldon) and is the variable we want to predict. This prediction is based on the previously mentioned lines in the script. After this we create a DFM from the tokenized corpus of the characters and their corresponding speech.

Next, we train the classifier. First we combine the target variable and LSA together in a data frame. We then take a sample of 80% of this data frame as the train set. The other 20% will be used as the test set. We then train the classifier with the ranger package. We then predict and show the results in a confusion matrix of the caret package. It can be noted that the base rate (here called “No Information Rate”) is 0.2946. With an accuracy of 0.3572, it can be concluded that it does better than it would by random sampling. However, it also can be said that this accuracy is quite low. Therefore, in the next couple of paragraphs we look at further improving the model and its accuracy.

Improving the features:

First, we transform the DFM to LSA, as we did in the previous paragraph. However, now we try to see for which number of dimensions the model gives the highest accuracy. A maximum number of 1000 dimensions is chosen, as with these dimensions the run time is already very long and the accuracy does not seem to increase significantly after 1000 dimensions.

The different accuracies for the number of dimensions 2 ,5, 25, 50, 100, 500, 1000 are respectively 0.2597865, 0.3128437, 0.3303138, 0.3377548, 0.3513426, 0.3581365, and 0.3568424. Due to the long run time and the fact that the accuracy curve is flattening, we choose a number of dimensions of 100, as this has a relative high accuracy, while taken the run time into account. We thus choose for a number of 100 dimensions (nd = 100) for the DFM and LSA.

Second, we now make the choice to try to further improve the model by first transforming the DFM into a TF-IDF. As we did in the paragraph above, we again try to figure out for which number of dimensions the model gives the best accuracy. The resulting accuracies are 0.2761242, 0.3023293, 0.3485927, 0.3489162, 0.3558719, 0.3587836, and 0.3536072 for respectively the number of dimension 2, 5, 25, 50, 100, 500, 1000.

After running several scenarios with different dimensions, we again choose to use 100 dimensions, as more dimensions will increase the run time immensely, while the improvement on the accuracy is minimal. Furthermore, we choose to use the tf-idf, as this outperforms the dfm by a small margin, namely 0.3513426 for the dfm and 0.3558719 for the tf-idf.

We now rerun the model with the chosen dimensions, so we can further improve on the accuracy with word embedding.

Word Embedding with glove

We choose 100 as number of iterations, as the loss does decrease with more iterations, but the run time becomes too extensive, compared to the improvement made with every additional iteration. So we make the arbitrary decision to keep the number of iterations at 100. Increasing the rank from 25 to 50 (with 100 iterations) decreases the loss from 0.0202 to 0.0051 and an accuracy of 0.3569 and 0.3692 respectively. Using a rank of 100 gives a loss of zero and an accuracy of 0.3614. It is thus interesting to see that the accuracy decreases for this value of 100 for the rank, while the loss decreases. As a rank of 50 returns the highest accuracy for the chosen values for the rank, we use this value in our further analysis.

For a window of 1, giving a loss of 0.0052 and accuracy of 0.3692 with a rank of 50. As the paragraph above has shown, a lower loss does not mean a higher accuracy, we thus compute both the loss and the accuracy for each window. Increasing the window to 5 gives a loss 0.0354 and an accuracy of 0.365. Decreasing the the window to 3 gives a loss of 0.0253 and an accuracy of 0.3687. Again, we decrease the window to 2 which shows a loss 0.0168 of and an accuracy of 0.3685. Again, the difference is accuracy is small, however as a window of 1 results in the highest accuracy, we chose this as our base for the further improvements with GloVe.

We tried another case where we add the length of the sentences. However, it decreases the accuracy to 0.3596. The explanation for this might be for the fact that each character has similar average lengths of lines of the script, as it is a very large corpus we are working with.

Also, it seems that adding the centers will decrease the accuracy, namely to 0.3621. Thus, we get no further improvement based on combining the centers from the GloVe model and the tf-idf, compared to the GloVe model by itself.

Out of curiosity and interest, we also tried to see how this accuracy would materialize in practice. The idea was to come up with a random sentence and see how the model would classify it. Sadly due to time constraint, we were not able to figure out the code to successfully predict a character name based on a random sentence. The code we tried is accessible within our files.

Concluding from the previously run code, it can be said that using the GloVe model by itself gives the highest accuracy. Two things should be noted however, namely that the difference in accuracy does not change a lot with the different methods. Furthermore, although the accuracy is higher than the base rate, it does not increase significantly. An explanation might be due to the fact that it are fictional characters, which differ in the show by personality, but not a lot by vocabulary as seen in the Exploratory Data Analysis with the token-to-ratio graph. In other words, as most scenes in the series are based on conversations between the characters, similar topics and words will be discusses by the characters. This causes the same kind of distinctive words to be used by multiple characters. Also, since Penny is perceived as the less intelligent character within the series, we would have expect the models to be able to predict Penny pretty well. However, from the confusion matrices it can be concluded that this is not the case.

Conclusion

Given the context of this series, we could have expected a much more scientific vocabulary. But we realized that it fits a lot in the codes of American sitcoms that puts forward the relationships between the characters, the dramas, and the typical subjects that we can expect from people of their age.

The second half of the show seems more positive score. We can suppose that the characters tend to become more positive and maybe less sarcastic as the series goes on.

Concerning the characters, Sheldon is the star of the show and this is felt in the analysis. He is always the one who talks the most and is the most intense.

The second half of the show, except for season 7, has a more positive score. We can suppose that the characters tend to become more positive and maybe less sarcastic as the series goes on.

Limits:

Some limits we encountered, could be :

Futur work:

As a future work, we might want to improve the Machine Learning part. Indeed, as already explained, the accuracy is not very large so we could try to find new techniques to improve or even try other machine learning model. Another aspect could be to continue the part we started about the prediction of a character based on a given sentence.

Reference

@NRC @AFINN @Stemming_lemma